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Abstract 

We construct efficient data structures that are resilient against a constant fraction of adversarial noise. 
Our model requires that the decoder answers most queries correctly with high probability and for the re- 
maining queries, the decoder with high probability either answers correctly or declares "don't know." 
Furthermore, if there is no noise on the data structure, it answers all queries correctly with high prob- 
ability. Our model is the common generalization of an error-correcting data structure model proposed 
recently by de Wolf, and the notion of "relaxed locally decodable codes" developed in the PCP literature. 

We measure the efficiency of a data structure in terms of its length, (the number of bits in its repre- 
sentation), and query-answering time, measured by the number of bit-probes to the (possibly corrupted) 
representation. We obtain results for the following two data structure problems: 

• (Membership) Store a subset S of size at most s from a universe of size n such that membership 
queries can be answered efficiently, i.e., decide if a given element from the universe is in S. 

We construct an error-correcting data structure for this problem with length nearly linear in s log n 
that answers membership queries with 0(1) bit-probes. This nearly matches the asymptotically 
optimal parameters for the noiseless case: length 0(s log n) and one bit-probe, due to Buhrman, 
Miltersen, Radhakrishnan, and Venkatesh. 

• (Univariate polynomial evaluation) Store a univariate polynomial g of degree dcg{g) < s over the 
integers modulo n such that evaluation queries can be answered efficiently, i.e., evaluate the output 
of 5 on a given integer modulo n. 

We construct an error-correcting data structure for this problem with length nearly linear in s log n 
that answers evaluation queries with polylog s ■ log^^"'-^' n bit-probes. This nearly matches the 
parameters of the best-known noiseless construction, due to Kedlaya and Umans. 
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1 Introduction 



The area of data structures is one of the oldest and most fundamental parts of computer science, in theory 
as well as in practice. The underlying question is a time-space tradeoff: we are given a piece of data, and 
we would like to store it in a short, space-efficient data structure that allows us to quickly answer specific 
queries about the stored data. On one extreme, we can store the data as just a list of the correct answers to all 
possible queries. This is extremely time-efficient (one can immediately look up the coiTcct answer without 
doing any computation) but usually takes significantly more space than the information-theoretic minimum. 
At the other extreme, we can store a maximally compressed version of the data. This method is extremely 
space-efficient but not very time-efficient since one usually has to undo the whole compression first. A good 
data structure sits somewhere in the middle: it does not use much more space than the information-theoretic 
minimum, but it also stores the data in a structured way that enables efficient query-answering. 

It is reasonable to assume that most practical implementations of data storage are susceptible to noise: 
over time some of the information in the data structure may be corrupted or erased by various accidental or 
malicious causes. This buildup of errors may cause the data structure to deteriorate so that most queries are 
not answered correctly anymore. Accordingly, it is a natural task to design data structures that are not only 
efficient in space and time but also resilient against a certain amount of adversarial noise, where the noise 
can be placed in positions that make decoding as difficult as possible. 

Ways to protect information and computation against noise have been well studied in the theory of 
eiTor-coiTccting codes and of fault-tolerant computation. In the data structure literature, constructions under 
often incomparable models have been designed to cope with noise, and we examine a few of these models. 
Aumann and Bender Q studied pointer-based data structures such as linked lists, stacks, and binary search 
trees. In this model, errors (adversarial but detectable) occur whenever all the pointers from a node are 
lost. They measure the dependency between the number of errors and the number of nodes that become 
irretrievable, and designed a number of efficient data structures where this dependency is reasonable. 

Another model for studying data structures with noise is the faulty-memory RAM model, introduced 
by Finocchi and Italiano ifTOl . In a faulty-memory RAM, there are 0(1) memory cells that cannot be 
corrupted by noise. Elsewhere, errors (adversarial and undetectable) may occur at any time, even during the 
decoding procedure. Many data structure problems have been examined in this model, such as sorting 18], 
searching |f9l, priority queues |[T3l and dictionaries |!5l|. However, the number of errors that can be tolerated 
is typically less than a linear portion of the size of the input. Furthermore, correctness can only be guaranteed 
for keys that are not affected by noise. For instance, for the problem of comparison-sorting on n keys, the 
authors of [JF] designed a resilient sorting algorithm that tolerates ^/T^ogn keys being corrupted and ensures 
that the set of uncoiTupted keys remains sorted. 

Recently, de Wolf [|20l considered another model of resilient data structures. The representation of the 
data structure is viewed as a bit-string, from which a decoding procedure can read any particular set of 
bits to answer a data query. The representation must be able to tolerate a constant fraction 6 of adversarial 
noise in the bit-stringj (but not inside the decoding procedure). His model generalizes the usual noise-free 
data structures (where 5 = 0) as well as the so-called "locally decodable codes" (LDCs) ||T4l . Informally, 
an LDC is an encoding that is tolerant of noise and allows fast decoding so that each message symbol 
can be retrieved correctly with high probability. Using LDCs as building blocks, de Wolf constructed data 
structures for several problems. 

Unfortunately, de Wolf's model has the drawback that the optimal time-space tradeoffs are much worse 
than in the noise-free model. The reason is that all known constructions of LDCs that make 0(1) bit- 
probes |[22l |7l have very poor encoding length (super-polynomial in the message length). In fact, the en- 

'We only consider bit-flip-errors here, not erasures. Since erasures are easier to deal with than bit-flips, it suffices to design a 
data structure dealing with bit-flip-errors. 
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coding length piovably must be super-linear in the message length |[T4l [T6l f2l\ . As his model is a gener- 
alization of LDCs, data structures cannot have a succinct representation that has length proportional to the 
information-theoretic bound. 

We thus ask: what is a clean model of data structures that allows efficient representations and has eiTor- 
correcting capabilities? Compared with the pointer-based model and the faulty-memory RAM, de Wolf's 
model imposes a rather stringent requirement on decoding: every query must be answered correctly with 
high probability from the possibly corrupted encoding. While this requirement is crucial in the definition of 
LDCs due to their connection to complexity theory and cryptography, for data structures it seems somewhat 
restrictive. 

In this paper, we consider a broader, more relaxed notion of error-correcting for data structures. In our 
model, for most queries, the decoder has to return the correct answer with high probability. However, for 
the few remaining queries, the decoder may claim ignorance, i.e., declare the data item unrecoverable from 
the (con^upted) data structure. Still, for every query, the answer is incoiTcct only with small probability. 
In fact, just as de Wolf's model is a generalization of LDCs, our model in this paper is a generalization 
of the "relaxed" locally decodable codes (RLDCs) introduced by Ben-Sasson, Goldreich, Harsha, Sudan, 
and Vadhan lH. They relax the usual definition of an LDC by requiring the decoder to return the correct 
answer on most rather than all queries. For the remaining queries it is allowed to claim ignorance, i.e., to 
output a special symbol '_L' interpreted as "don't know" or "unrecoverable." As shown in IJl, relaxing the 
LDC-definition like this allows for constructions of RLDCs with 0(1) bit-probes of nearly linear length. 

Using RLDCs as building blocks, we construct error-correcting data structures that are very efficient in 
terms of time as well as space. Before we describe our results, let us define our model formally. First, a data 
structure problem is specified by a set D of data items, a set Q of queries, a set A of answers, and a function 
f : D X Q ^ A which specifies the correct answer f{x,q) of query q to data item x. A data structure for 
/ is specified by four parameters: t the number bit-probes, 5 the fraction of noise, e an upper bound on the 
error probability for each query, and A an upper bound on the fraction of queries in Q that are not answered 
correctly with high probability (the 'A' stands for "lost"). 

Definition 1. Let f : D x Q ^ Ahe. & data structure problem. Let t > be an integer, 5 G [0, 1], 
e G [0,1/2], and A G [0,1]. We say that / has a {t,6,e, X)-data structure of length if there exist an 
encoder £ : D — )• {0,1}^ and a (randomized) decoder V with the following properties: for every x ^ D 
and every wG{0,l}^at Hamming distance lS.{w^£{x)) < SN, 

1 . V makes at most t bit-probes to w, 

2. Fr[V'"{q) G {f{x, q), _L}] > 1 - e for every q € Q, 

3. the set G = {q : Pr[P"'(g) = f{x, q)] > 1 - e} has size at least (1 - A)|Q| ('G' stands for "good"), 

4. if w = £{x), then G = Q. 

Here V^{q) denotes the random variable which is the decoder's output on inputs w and q. The notation 
indicates that it accesses the two inputs in different ways: while it has full access to the query q, it only has 
bit-probe access (or "oracle access") to the string w. 

We say that a (t, 5, e, A)-data structure is error-correcting, or an error-correcting data structure, if 5 > 0. 
Setting A = recovers the original notion of eiTor-coiTcction in de Wolf's model iJOl. A (t, 5, e, \)-relaxed 
locally decodable code (RLDC), defined in |i4j, is an eiTor-coiTccting data structure for the membership 
function / : {0, 1}" x [n] — )• {0, 1}, where f{x, i) = xi. A (t, 5, e)-locally decodable code (LDC), defined 
by Katz and Trevisan IHl, is an RLDC with A = 0. 
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Remark. For the data structure problems considered in this paper, our decoding procedures make only non- 
adaptive probes, i.e., the positions of the probes are determined all at once and sent simultaneously to the 
oracle. For other data structure problems it may be natural for decoding procedures to be adaptive. Thus, 
we do not require V to be non-adaptive in Condition 1 of Definition [T] 

1.1 Our results 

We obtain efficient eiTor-correcting data structures for the following two data structure problems. 

Membership: Consider a universe [n] = {1, . . . , n} and some nonnegative integer s < n. Given a set 
S C [n] with at most s elements, one would like to store 5 in a compact representation that can answer 
"membership queries" efficiently, i.e., given an index i G [n], determine whether or not i ^ S. Formally 
D = {S : S <^ [n], \S\ < s},Q = [n], and A = {0, 1}. The function Mem„,,(5, is 1 if i G 5 and 
otherwise. 

Since there are at least (") subsets of the universe of size at most s, each subset requiring a different 
instantiation of the data structure, the information-theoretic lower bound on the space of any data structure 
is at least log (^) w s log n bitsU An easy way to achieve this is to store S in sorted order. If each number 
is stored in its own log 77,-bit "cell," this data structure takes s cells, which is s log n bits. To answer a 
membership query, one can do a binary search on the list to determine whether i ^ S using about log s 
"cell-probes," or log s ■ log n bit-probes. The length of this data structure is essentially optimal, but its 
number of probes is not. Fredman, Komlos, and Szemeredi lITTI developed a famous hashing-based data 
structure that has length 0{s) cells (which is O(slogn) bits) and only needs a constant number of cell- 
probes (which is O(logn) bit-probes). Buhrman, Miltersen, Radhakiishnan, and Venkatesh |i6i| improved 
upon this by designing a data structure of length 0(s log n) bits that answers queries with only one bit-probe 
and a small error probability. This is simultaneously optimal in terms of time (clearly one bit-probe cannot 
be improved upon) and space (up to a constant factor). 

None of the aforementioned data structures can tolerate a constant fraction of noise. To protect against 
noise for this problem, de Wolf [20] constructed an error-correcting data structure with A = using a lo- 
cally decodable code (LDC). That construction answers membership queries in t bit-probes and has length 
roughly L{s, t) log n, where L(s, t) is the shortest length of an LDC encoding s bits with bit-probe com- 
plexity t. Cun^ently, all known LDCs with t = 0(1) have L{s, t) super-polynomial in s ||3]|22l|7l. In fact, 
L(s, t) must be super-linear for all constant t, see e.g. |[T4l[T6ll2n . 

Under our present model of error-correction, we can construct much more efficient data structures with 
error-correcting capability. First, it is not hard to show that by composing the BMRV data structure IS 
with the eiTor-coiTccting data structure for Mem„ „ (equivalently, an RLDC) [4J, one can already obtain an 
error-correcting data structure of length 0{{s log nY~^^), where rj is an arbitrarily small constant. However, 
following an approach taken in |[20l . we obtain a data structure of length 0(s^'^'' log n), which is much 
shorter than the aforementioned construction if s = o(log n). 

Theorem 1. For every £,r] £ (0, 1), there exist an integer t > and real r > 0, such that for all s and n, 
and every 6 < t, Mem„^s has a (t, 5, e, ^)-data structure of length 0{s^~^^ log n). 

We will prove Theorem [T] in Section [2l Note that the size of the good set G is at least n — §. Hence 
corrupting a (5-fraction of the bits of the data structure may cause a decoding failure for at most half of the 
queries z G 5 but not all. One may replace this factor ^ easily by another constant (though the parameters t 
and r will then change). 

"Our logs are always to base 2. 
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Polynomial evaluation: Let Z„ denote the set of integers modulo n and s < n be some nonnegative 
integer. Given a univariate polynomial g G "^niX] of degree at most s, we would like to store gin a. compact 
representation so that for each evaluation query a G Z„, g{a) can be computed efficiently. Formally, 
D = {g : g ^ Zn[X],deg{g) < s}, Q = Z„, and A = Z„, and the function is PolyEval„^s(5'i cl) = g{a). 

Since there are n'^^^ polynomials of degree at most s, with each polynomial requiring a different instan- 
tiation of the data structure, the information-theoretic lower bound on the space of any data structure for this 
problem is at least log(n*"'"^) w s log n bits. Since each answer is an element of Z„ and must be represented 
by [log nj + 1 bits, [log n\ + 1 is the information- theoretic lower bound on the bit-probe complexity. 

Consider the following two naive solutions. On one hand, one can simply record the evaluations of g 
in a table with n entries, each with [lognj + 1 bits. The length of this data structure is 0(n log n) and 
each query requires reading only [log nJ + 1 bits. On the other hand, g can be stored as a table of its s + 1 
coefficients. This gives a data structure of length and bit-probe complexity (s + l)([lognJ + 1). 

A natural question is whether one can construct a data structure that is optimal both in terms of space 
and time, i.e., has length O(slogn) and answers queries with O(logn) bit-probes. No such constructions 
are known to exist. However, some lower bounds are known in the weaker cell-probe model, where each 
cell is a sequence of [lognJ + 1 bits. For instance, as noted in lITSl . any data structure for POLYNOMIAL 
EVALUATION that Stores O(s^) cells (0(s^ log n) bits) requires reading at least Q.{s) cells {Q{s log n) bits). 
Moreover, by lITTl . if log s log s and the data structure is constrained to store s*^(^) cells, then its query 
complexity is Q,{s) cells. This implies that the second trivial construction described above is essentially 
optimal in the cell-probe model. 

Recently, Kedlaya and Umans lITSl obtained a data structure of length s^^"^ log^"*""^^^ n (where i] is an 
arbitrarily small constant) and answers evaluation queries with 0(polylog s ■ log^"*""^^^ n) bit-probes. These 
parameters exhibit the best tradeoff between s and n so far. When s = for some < < 1, the data 
structure of Kedlaya and Umans |[T5l is much superior to the trivial solution: its length is nearly optimal, 
and the query complexity drops from poly n to only polylog n bit-probes. 

Here we construct an en^or-correcting data structure for the polynomial evaluation problem that works 
even in the presence of adversarial noise, with length neai^ly linear- in s log 7i and bit-probe complexity 
O(polylog s ■ log^^"^^^ n). Formally: 

Theorem 2. For every £, X,ri G (0, 1), there exists r G (0, 1) such that for all positive integers s < n, for 
all 6 < T, the data structure problem POLYEYALn^s has a (O (polylog s-log^^"*-^-* n), 5, e, X) -data structure 
of length 0{{s\ognf+'t). 

Remark. We note that Theorem |2] easily holds when s = (logn)°(^^. As we discussed previously, one can 
just store a table of the s + 1 coefficients of g. To make this error-correcting, encode the entire table by a 
standard eiTor-coiTccting code. This has length and bit-probe complexity 0{s log n) = 0{\og^~^°^^^ n). 

1.2 Our techniques 

At a high level, for both data structure problems we build our constructions by composing a relaxed locally 
decodable code with an appropriate noiseless data structure. If the underlying probe-accessing scheme in a 
noiseless data structure is "pseudorandom," then the noiseless data structure can be made error-correcting by 
appropriate compositions with other data structures. By pseudorandom, we mean that if a query is chosen 
uniformly at random from Q, then the positions of the probes selected also "behave" as if they are chosen 
uniformly at random. Such property allows us to analyze the eiTor-tolerance of our constructions. 

More specifically, for the MEMBERSHIP problem we build upon the noiseless data structure of Buhrman 
et al. |i6|. While de Wolf |[20l combined this with LDCs to get a rather long data structure with A = 0, we 
will combine it here with RLDCs to get nearly optimal length with small (but non-zero) A. In order to bound 
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A in our new construction, we make use of the fact that the |[6l -construction is a bipartite expander graph, 
as explained below after Theorem IH This property wasn't needed in f20l. The left side of the expander 
represents the set of queries, and a neighborhood of a query (a left node) represents the set of possible bit- 
probes that can be chosen to answer this query. The expansion property of the graph essentially implies that 
for a random query, the distribution of a bit-probe chosen to answer this query is close to uniformH This 
property allows us to construct an efficient, error-correcting data structure for this problem. 

For the polynomial evaluation problem, we rely upon the noiseless data structure of Kedlaya and 
Umans |[T5l . which has a decoding procedure that uses the reconstructive algorithm from the Chinese Re- 
mainder Theorem. The property that we need is the simple fact that if a is chosen uniformly at random from 
Z„, then for any m < n, a modulo m is uniformly distributed in Z^- This implies that for a random evalu- 
ation point a, the distribution of certain tuples of cell-probes used to answer this evaluation point is close to 
uniform. This observation allows us to construct an efficient, error-correcting data structure for polynomial 
evaluation. Our construction follows the non-error-correcting one of |[T5l fairly closely; the main new ingre- 
dient is to add redundancy to their Chinese Remainder-based reconstruction by using more primes, which 
gives us the error-correcting features we need. 

Time-complexity of decoding and encoding. So far we have used the number of bit-probes as a proxy 
for the actual time the decoder needs for query-answering. This is fairly standai^d, and usually justified 
by the fact that the actual time complexity of decoding is not much worse than its number of bit-probes. 
This is also the case for our constructions. For MEMBERSHIP, it can be shown that the decoder uses 0(1) 
probes and polylog(n) time (as do the RLDCs of |i4|). For POLYNOMIAL EVALUATION, the decoder uses 
polylog(s) log^^°'^^-'(n) probes and polylog(sn) time. 

The efficiency of encoding, i.e., the "pre-processing" of the data into the form of a data structure, for 
both our error-correcting data structures MEMBERSHIP and POLYNOMIAL EVALUATION depends on the 
efficiency of encoding of the RLDC constructions in |j41. This is not addressed expUcitly there, and needs 
further study. 

2 The Membership problem 

In this section we construct a data structure for the membership problem Mem„ s. First we describe some 
of the building blocks that we need to prove Theorem [T] Our first basic building block is the relaxed locally 
decodable code of Ben-Sasson et al. f4] with nearly linear length. Using our terminology, we can restate 
their result as follows: 

Theorem 3 (BGHSV H). For every e € (0, 1/2) and rj > 0, there exist an integer t > and reals c > 
and T > 0, such that for every n and every 6 < t, the membership problem Mem„^„ has a (t, 5, £, c6)-data 
structure for MEMn^n of length 0(n^+'''). 

Note that by picking the eiTor-rate S a sufficiently small constant, one can set X = c5 (the fraction of 
unrecoverable queries) to be very close to 0. 

The other building block that we need is the following one-probe data structure of Buhrman et al. 161 . 

Theorem 4 (BMRV [6}). For every e G (0, 1/2) andfor every positive integers s < n, there is an (1, 0, e, 0)- 
data structure for Mem^^s of length m = log n bits. 

'We remark that this is different from the notion of smooth decoding in the LDC literature, which requires that for ewery fixed 
query, each bit-probe by itself is chosen with probability close to uniform (though not independent of the other bit-probes). 
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Properties of the BMRV encoding: The encoding can be represented as a bipartite graph Q = (L, R, E) 
with \L\ = n left vertices and \R\ = m right vertices, and regular left degree d = Q is an expander 

graph: for each set 5 C L with \S\ < 2s, its neighborhood T{S) satisfies |r(5)| > (l - f ) \S\d. For each 
assignment of bits to the left vertices with at most s ones, the encoding specifies an assignment of bits to the 
right vertices. In other words, each x G {0, 1}" of weight \x\ < s corresponds to an assignment to the left 
vertices, and the m-bit encoding of x corresponds to an assignment to the right vertices. 

For each i G [n] we write Fj := r({i}) to denote the set of neighbors of i. A crucial property of the 
encoding function Sbmrv is that for every x of weight \x\ < s, for each i £ [n],if y = S^^rvix) S {0, 1}"^ 
then Pij^nixi = Vj] > 1 — e. Hence the decoder for this data structure can just probe a random index 
j € Fi and return the resulting bit yj. Note that this construction is not error-correcting at all, since |rj| 
errors in the data structure suffice to erase all information about the i-th bit of the encoded x. □ 

As we mentioned in the Section 11.11 by combining the BMRV encoding with the data structure for 
MEM^ n from Theorem [3j one easily obtains an (0(1), 5, e, 0((5))-data structure for Mem.„ .j of length 
O {{s log nY'^'^). However, we can give an even more efficient, error-correcting data structure of length 
0{s^~^^ log n). Our improvement follows an approach taken in de Wolf |[20l , which we now describe. For 
a vector x G {0, l}*^ with |x| < s, consider a BMRV structure encoding 20n bits into m bits. Now, from 
Section 2.3 in |[20l . the following "balls and bins estimate" is known: 

Proposition 5 (From Il20ll ). For every positive integers s < n, the BMRV bipartite graph Q = ([20n], [m], E) 
for MEM20n,s with error parameter ^ has the following property: there exists a partition of [rri\ into 
6 = 10 log(20n) disjoint sets Bi, . . . , Bi, o/lO'^s vertices each, such that for each i G \n\, there are at least 
I sets Bk satisfying |r,j n = 1. 

Proposition |5] suggests the following encoding and decoding procedures. To encode x, we rearrange the 
m bits of £bmrv{x) into 0(logn) disjoint blocks of 0(s) bits each, according to the partition guaranteed 
by Proposition |5] Then for each block, encode these bits with the error-correcting data structure (RLDC) 
from Theorem [3] Given a received word w, to decode i G [n], pick a block Bk at random. With probability 
at least j, Tj n = {j} for some j. Run the RLDC decoder to decode the j-th bit of the A:-th block of 
w. Since most blocks don't have much higher error-rate than the average (which is at most 6), with high 
probability we recover £iyrnrv{x)j, which equals Xi with high probability. Finally, we will argue that most 
queries do not receive a blank symbol _L as an answer, using the expansion property of the BMRV encoding 
structure. We now proceed with a formal proof of Theorem [T] 

Proof of TheoremUl We only construct an error-correcting data structure with error probability 0.49. By 
a standard amplification technique we can reduce the error probability to any other positive constant (i.e., 
repeat the decoder 0(log(l/e)) times). 

By Theorem m there exists an encoder Sbmrv for an (1,0, ^,0)-data structure for the membership 
problem MEM20n,s of length m = lO^s log(20n). Let s' = lO^s. By Theorem [3l for every 77 > 0, 
for some t = 0(1), and sufficiently small 5, Mem^/^s' has an {t, 10^6, 0((5))-data structure of length 
s" = 0{s'^~^'^). Let Sbghsv and Vbghsv be its encoder and decoder, respectively. 

Encoding. Let Bi, . . . , Bbbe a. partition of [m] as guaranteed by Proposition [5] For a string w G {0, 1}"*, 
we abuse notation and write w = wb^ ••■ wb^, to denote the string obtained from w by applying the permu- 
tation on [in] according to the partition Bi, . . . , B^. In other words, wb^, is the concatenation of Wi where 
i £ B^. We now describe the encoding process. 
Encoder 8: on input x G {0, 1}", |x| < s, 

1. Let ?/ = Sbmrv (2:0^^") and write y = yB^ ■ ■ - yBf,- 
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2. Output the concatenation £{x) = Ebghsv iVBi) ■ ■ ■ £bghsv (yej- 
The length of £{x) is N = b ■ 0{s'^+'^) = 0{s^+'^ log n). 

Decoding. Given a string w £ {0, l}'^, we write w = w^^"^ . . . w^''\ where for k e [b], w^^"^ denotes the 
s"-bit string Wsii.f^k-i)+i ■ ■ ■ Ws"-k- 

Decoder V: on input i and with oracle access to a string w G {0,1}^, 

1 . Pick a random A; G [6] . 

2. If iFj n Sfcl / 1, then output a random bit. 

Else, let Fj n = {j}. Run and output the answer given by the decoder Vhghgy^j), with oracle 
access to the s"-bit string w^''\ 

Analysis. Fix x £ D and w G {0,1}^ such that A{w,£{x)) < dN, where 6 is less than some small 
constant r to be specified later. We now verify the four conditions of Definition [T] For Condition 1, note 
that the number of probes the decoder V makes is the number of probes the decoder V^ghgy makes, which 
is at most t, a fixed integer. 

We now examine Condition 2. Fix i G [n]. By Markov's inequality, for a random A; G [b], the probability 
that the relative Hamming distance between £ {yB^) ^i^d w^''^ is greater than 10^6 is at most 10~^. If k is 
chosen such that the fraction of eiTors in w^''^ is at most 10^5 and Ti D = {j}, then with probability at 
least 0.99, V^ghsv outputs t/j or _L. Let /? > j be the fraction of G [b] such that \Ti n = 1. Then 

1 99 1 

Pr[P(i) G {xi, ±}]>(l-p)-+p—- — > 0.624. (1) 

To prove Condition 3, we need the expansion property of the BMRV structure, as explained after The- 
oremU For k G [b], define Gk C B^. so that j G Gk if Pr V^gHLij) = yj > 0.99. In other words, Gk 
consists of indices in block Bk that are answered correctly by V^ghsv with high probability. By Theorem |3l 
if the fraction of errors in w^''^ is at most 10^6, then \Gk\ > {1 — cS)\Bk\ for some fixed constant c. Set 
A = Ufcg[fe]i?fc\Gfc, Since we showed above that for a (1 — 10~^)-fraction of A; G [b], the fractional number 
of errors in w^''^ is at most 10^6, we have \A\ < c6m + 10~^m. 

Recall that the BMRV expander has left degree d = 10 log(20n). Take S small enough that \ A\ < -^sd; 
this determines the value of r of the theorem. We need to show that for any such small set A, most queries 
i G [n] are answered correctly with probability at least 0.51. It suffices to show that for most i, most of the 
set Fj falls outside of A. To this end, let B{A) = {i £ [n] : jFj nA\> ^}. We show that if A is small then 
B{A) is small. 

Claim 6. For every A C [m] with \ A\ < |g, it is the case that \B{A)\ < |. 

Proof. Suppose, by way of contradiction, that -6(^4) contains a set W of size s/2. is a set of left vertices 
in the underlying expander graph Q, and since \ W\ < 2s, we must have 

\r{w)\> (i-^) d\w\. 

By construction, each vertex in W has at most j^d neighbors outside A. Thus, we can bound the size of 
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r(W) from above as follows 



1 - — ) d\W\. 

This is a contradiction. Hence no such W exists and < |. □ 

Define G = [n\\B{A) and notice that |G| > n — |. It remains to show that each query i G G is 
answered correctly with probability > 0.51. To this end, we have 

Pr[P(i) =_L] < Pr[P probes a block with noise-rate > 10^5] + 

Pr[P probes a j e A] + Fv[V{i) =_L: V probes aj ^ A] 

1 1 1 

< -^ + — + < 0.111. 

- 105 10 100 

Combining with Eq. ([T]), for alH G G we have 

Pr[P(i) = Xi] = Pr[P(i) G {xi, _L}] - Pr[P(i) =_L] > 0.51. 
Finally, Condition 4 follows from the corresponding condition of the data structure for MEM^.n- □ 



3 The POLYNOMIAL EVALUATION problem 

In this section we prove Theorem [2l Given a polynomial g of degree s over Z„, our goal is to write down 
a data structure of length roughly linear in s log n so that for each a G g{a) can be computed with ap- 
proximately polylog s ■ log n bit-probes. Our data structure is built on the work of Kedlaya and Umans |[T5l . 
Since we cannot quite use their construction as a black-box, we first give a high-level overview of our proof, 
motivating each of the proof ingredients that we need. 



Encoding based on reduced polynomials: The most naive construction, by recording g{a) for each a G 
has length n log n and answers an evaluation query with log n bit-probes. As explained in ||T5l , one can 
reduce the length by using the Chinese Remainder Theorem (CRT): If Pi is a collection of distinct primes, 
then a nonnegative integer m < OpePi P uniquely specified by (and can be reconstructed efficiently from) 
the values [m\p for each p ^ Pi, where [m]p denotes m mod p. 

Consider the value g{a) over Z, which can be bounded above by n''"'"^, for a G Let Pi consist of the 
first log(n'^"'"^) primes. For each p G Pi, compute the reduced polynomial gp := g mod p and write down 
gp{b) for each b G Zp. Consider the data structure that simply concatenates the evaluation table of every 
reduced polynomial. This data structure has length \Pi\{maXp^p-^ which is log^"^"^^-* n by 

the Prime Number Theorem (see Fact[T2]in Appendix iBt. Note that g{a) < JlpePi P- '^^ compute [(jr(a)]„, 
it suffices to apply CRT to reconstruct g{a) over Z from the values [g(a)]p = gp{[a\p) for each p £ Pi. The 
number of bit-probes is |Pi| log(maXpgp^ p), which is s^+°(^) log^"''"''^^ n. 
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Error-correction with reduced polynomials: The above CRT-based construction has terrible parameters, 
but it serves as an important building block from which we can obtain a data structure with better parameters. 
For now, we explain how the above CRT-based encoding can be made error-correcting. One can protect the 
bits of the evaluation tables of each reduced polynomial by an RLDC as provided by Theorem [3] However, 
the evaluation tables can have non-binary alphabets, and a bit-flip in just one "entry" of an evaluation table 
can destroy the decoding process. To remedy this, one can first encode each entry by a standard error- 
correcting code and then encode the concatenation of all the tables by an RLDC. This is encapsulated in 
Lemma |7J which can be viewed as a version of Theorem [3] over non-binary alphabet. We prove this in 
Appendix lAl 

Lemma 7. Let f : D x Q ^ {0, 1}^ be a data structure problem. For every e,r], X G (0, 1), there exists 
T S (0,1) such that for every 5 < r, f has an {0{l),6,e, X)-data structure of length 0{{i\Q\y~^^). 

To apply Lemma 111 let D be the set of degree- s polynomials over Z„, Q be the set of all evaluation 
points of all the reduced polynomials of g (each specified by a pair (a,p)), and the data structure problem / 
outputs evaluations of some reduced polynomial of g. 

By itself. Lemma |7] cannot guarantee resilience against noise. In order to apply the CRT to reconstruct 
g{a), all the values {[s'(a)]p : p ^ Pi} must be con^ect, which is not guai^anteed by LemmajT] To fix this, we 
add redundancy, taking a larger set of primes than necessary so that the reconstruction via CRT can be made 
error-correcting. Specifically, we apply a Chinese Remainder Code, or CRT code for short, to the encoding 
process. 

K 

Definition 2 (CRT code). Let pi < P2 < • • • < pat be distinct primes, K < N, and T = Y[ Pi- The 

i=l 

Chinese Remainder Code (CRT code) with basis pi,. . . ,pN and rate ^ over message space encodes 

m G Zt as {[m]p^,[m]p^, . . . , [m]p^). 

Remark. By CRT, for distinct mi, m2 G Zy, their encodings agree on at most K — 1 coordinates. Hence 
the Chinese Remainder Code with basis pi < . . . < p^ and rate ^ has distance N — K + 1. 

It is known that good families of CRT code exist and that unique decoding algorithms for CRT codes 
(see e.g., (Tlj ) can correct up to almost half of the distance of the code. The following statement can be 
easily derived from known facts, and we include a proof in Appendix iBl 

Theorem 8. For every positive integer T, there exists a set P consisting of distinct primes, with (I) \P\ = 
0(log T), and (2)\/p S P, log T < p < 500 log T, such that a CRT code with basis P and message space 
Zt has rate \, and can correct up to a{\ — O ( log y ) ) -fraction of errors. 

We apply Theorem [8] to a message space of size n*+^ to obtain a set of primes Pi with the properties 
described above. Note that these primes are all within a constant factor of one another, and in particular, 
the evaluation table of each reduced polynomial has the same length, up to a constant factor. This fact and 
Lemma|7]will ensure that our CRT-based encoding is eiTor-coiTccting. 

Reducing the bit-probe complexity: We now explain how to reduce the bit-probe complexity of the 
CRT-based encoding, using an idea from |[T5l . Write s = d"^, where d = log*^ s, m = ^ \o^\og s ' ^^'^ 
C > 1 is a sufficiently large constant. Consider the following multilinear extension map V'd.m : Z„ [X] — )• 
Z„[Xo, . . . ,Xm-i\ that sends a univariate polynomial of degree at most s to an m-variate polynomial of 
degree less than d in each variable. For every i G [s], write i = X^J^T^j^ ijd^ in base d. Define ipd,Tn which 

sends to • • • and extends multilineaiiy to Z.„[X]. 
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To simplify our notation, we write g to denote the multivariate polynomial ipd^rnig)- For every a G 
define d G to be ([a]„, [a'^]n, [a'^^]ni • • • ) [o'^'" ^]n)- Note that for every a G Z„, (7(a) = g(a) (mod n). 
Now the trick is to observe that the total degree of the multilinear polynomial g is less than the degree of the 
univariate polynomial g, and hence its maximal value over the integers is much reduced. In particular, for 
every a G Z^, the value ipd,m{g){0') over the integers is bounded above by d™n'^™"'"^. 

We now work with the reduced polynomials of g for our encoding. Let Pi be the collection of primes 
guaranteed by Theorem [8] when Ti = dJ^n'^^^^. For p £ Pi, let gp denote g mod p and dp denote the 
point {[a]p, [a'^]p, . . . , [a'^'" ]p). Consider the data structure that concatenates the evaluation table of gp for 
each p G Pi. For each a G to compute g{a), it suffices to compute g{d) over Z, which by Theorem[8] 
can be reconstructed (even with noise) from the set {gp{dp) : p G Pi}. 

Since the maximum value of g is at most Ti = d^n'^^'^^ (whereas the maximum value of g is at 
most ci™n'^" "'"^), the number of primes we now use is significantly less. This effectively reduces the 
bit-probe complexity. In particular, each evaluation query can be answered with | Pi | • maxpg log p = 
{dm log n)^+°(^^ bit-probes, which by our choice of d and m is equal to polylog s ■ log^~*'°''^^ n. However, 
the length of this encoding is still far from the information-theoretically optimal s log n bits. We shall ex- 
plain how to reduce the length, but since encoding with multilinear reduced polynomials introduces potential 
complications in error-correction, we first explain how to circumvent these complications. 

Error-correction with reduced multivariate polynomials: There are two complications that arise from 
encoding with reduced multivariate polynomials. The first is that not all the points in the evaluation tables 
are used in the reconstructive CRT algorithm. Lemma|7]only guarantees that most of the entries of the table 
can be decoded, not all of them. So if the entries that are used in the reconstruction via CRT are not decoded 
by Lemma |7J then the whole decoding procedure fails. 

More specifically, to reconstruct g{d) over Z„, it suffices to query the point dp in the evaluation table 
of cjp for each p £ Pi. Typically the set {dp : a G Z„} will be much smaller than Z^, so not all the 
points in Z™ are used. To circumvent this issue, we only store the query points that are used in the CRT 
reconstruction. Let P^ = {dp : a G Z„}. For each p G Pi, the encoding only stores the evaluation of gp at 
the points instead of the entire domain Z™. The disadvantage of computing the evaluation at the points 
in is that the encoding stage takes time proportional to n. We thus give up on encoding efficiency (which 
was one of the main goals of Kedlaya and Umans) in order to guarantee error-correction. 

The second complication is that the sizes of the evaluation tables may no longer be within a constant 
factor of each other. (This is true even if the evaluation points come from all of Z™.) If one of the tables 
has length significantly longer than the others, then a constant fraction of noise may completely corrupt the 
entries of all the other small tables, rendering decoding via CRT impossible. This potential problem is easy 
to fix; we apply a repetition code to each evaluation table so that all the tables have equal length. 

Reducing the length: Now we explain how to reduce the length of the data structure to nearly slog?i, 
along the lines of Kedlaya and Umans |[T5l . To reduce the length, we need to reduce the magnitude of 
the primes used by the CRT reconstruction. We can effectively achieve that by applying the CRT twice. 
Instead of storing the evaluation table of gp, we apply CRT again and store evaluation tables of the reduced 
polynomials of gp instead. Whenever an entry of gp is needed, we can apply the CRT reconstruction to the 
reduced polynomials of gp. 

Note that for pi G Pi, the maximum value of gp-^ (over the integers rather than mod n) is at most 
T2 = d"^pf^^\ Now apply Theorem[8]with T2 the size of the message space to obtain a collection of primes 
P2. Recall that each pi G Pi is at most 0{dm log n). So each p2 G P2 is at most 0((dm)^+°(^) log log n), 
which also bounds the cardinality of P2 from above. 
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For each query, the number of bit-probes made is at most |i-'i||-P2| niaXpjePa logp2> which is at most 
(dm)^+°(^) log^''"°^^^ n. Recall that by our choice d = log'-'' s and m = ^jj^p^^, we have dm = ^^p^- 

Thus, the bit-probe complexity is polylog s ■ log^"*""''^^ n. 

Next we bound the length of the encoding. Recall that by the remark following Theorem |2l we may 
assume without loss of generality that s = 0(log^?T.) for some < ^ < 1. This implies log log n = 
0(log s). Then for each p2 G P2, 

P2 < {O log log n))"" < • s^+°^^^ < si+^+°{i). 

Now, by Lemma|2l the length of the encoding is nearly linear in |Pi||P2| maXp26P2 P2^ logP2> which is at 
most polylog s ■ log^^"*^^^ n ■ maXpjgPa P™- Putting everything together, the length of the encoding is nearly 
linear in s log n. We now proceed with a formal proof. 

Proof of Theorem^ We only construct an en^or-con-ecting data structure with error probability e = |. By 
a standard amplification technique (i.e., 0(log(l/e)) repetitions) we can reduce the error probability to any 
other positive constant. We now give a formal description of the encoding and decoding algorithms. 

Encoding: Apply Theorem [8] with T = d"^n'^"^~^^ to obtain a collection of primes Pi. Apply Theorem [8] 
with T = d'"(maxpg p^^m+i obtain a collection of primes P2. Set Pmax = niaxpjGPa P'^- 

Now, for each pi G Pi, p2 G P2, define a collection of evaluation points pPi'P2 = {a^^^pj • ^ ^ ^n}- 
Fix a univariate polynomial g G l^nix] of degree at most s. For every pi G Pi, P2 G P2> view each 
evaluation of the reduced multivariate polynomial (jp^^p^ as a bit-string of length exactly \log pmax~\- Let 



L = maXpjgPj^p2gP2 \BP^'P^\ and for each pi G Pi, P2 £ P2, set r^^ 



.P2 



. Define ff^^P^ to be 



\BP1'P2\ 

the concatenation of r^i'Pz copies of the string {g{q))qeBPi'P2- Define the string / = {f^^'^'^) p^^p^^p^^p^. 

We want to apply Lemma |7] to protect the string /, which we can since / may be viewed as a data 
structure problem, as follows. The set of data-items is the set of polynomials g as above. The set of queries 

Q is U BP^^P^ X [rPi'P2]. The answer to query (gPi-Pa, i^i-Pa) is the i^i'P^-th copy of gp^^p^{qP^'P''). 

Pl&Pl,P2&P2 

Fix A G (0, 1). By Lemma|7j for every r] > 0, there exists tq G (0, 1) such that for every 6 < tq, the 
data structure problem corresponding to / has a {0(logpmax), 2~^°, A^2~^^)-data structure. Let <So,^o 
be its encoder and decoder, respectively. Finally, the encoding of the polynomial g is simply 



£ig)=£o{f). 

Note that the length of £{g) is at most (|Pi||P2| maXp2GP2 logp2)^^''> which as we computed eaiiier 
is bounded above by 0((s log 77,)^+^) for some arbitrarily small constant (. 

Decoding: We may assume, without loss of generality, that the CRT decoder Vert from Theorem|7]outputs 
± when more than a ^^^-fraction of its inputs are erasures (i.e., ± symbols). 

The decoder V, with input a G Z„ and oracle access to w, does the following: 



1. Compute d = {a,a'^, . . . ,a'^"' ^) G , and for every pi G Pi, P2 G P2, compute the reduced 
evaluation points cip^^p^. 

2. For every pi G Pi,p2 G P2,pickj G [r^^'^^] uniformly at random and run the decoder Pq with oracle 
access to w to obtain the answers v^2p2 = ^o(«pi,p2;i)- 



3. For every pi G Pi obtain ^'^^ = V^rt ( f 4i P2) ) • 

\V ' 'P2&P2J 
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4. Output = V,rt • 



Analysis: Fix a polynomial g with degree at most s. Fix a bit-string w at relative Hamming distance at 
most 6 from £{g), where 5 is at most tq. We proceed to verify that the above encoding and decoding satisfy 
the conditions of Definition [U 

Conditions 1 and 4 are easily verified. For Condition 1, observe that for each pi G Pi, P2 G P2, ^0 
makes at most O(logPmax) bit-probes. So V makes at most 0(|Pi||P2| logPmax) bit-probes, which as we 
calculated eai^lier is at most polylog s ■ log^"'""'^^' n. 

For Condition 4, note that since Vq decodes correctly when no noise is present, Vp^^pj equal to 
9pi,P2i^pi,P2)- o"'' choice of Pi and P2, after two applications of the Chinese Remainder Theorem, 
it is easy to see that V outputs v = g{a), which equals g{a). 

Now we verify Condition 2. Fix a G Z„. We want to show that with oracle access to w, with probability 
at least |, the decoder V on input a outputs either g{a) or _L. For vr G Pi U (Pi x P2), we say that a point 

v^'^ is incorrect ifv'^'^ ^ {9^(0,7), -L}. 

By Lemma|7J for each pi G Pi and p2 G P2, ^'pi^p2 incon^ect with probability at most 2^^^ . Now fix 
Pi G Pi. On expectation (over the decoder's randomness), at most a 2~^'^-fraction of the points in the set 
{^pi!p2 • P2 G ^^2} are incorrect. By Markov's inequality, with probability at least 1 — 2~^, the fraction of 
points in the set {v^^^p^ '■ P2 ^ P2} that are incorrect is at most j^. If the fraction of blank symbols in the 
set {t'pi^p2}p2eP2 is at least then Vert outputs _L, which is acceptable. Otherwise, the fraction of eiTors 
and erasures (i.e., _L symbols) in the set {v'^^^p^ : P2 G P2} is at most |. By Theorem [8j the decoder Dcrt 
will output an incorrect Wp"^ with probability at most 2~^. Thus, on expectation, at most a 2~^-fraction of 
the points in {v^^ : pi G Pi} are incorrect. By Mai^kov's inequality again, with probability at least |, at 

most a ^-fraction of the points in {v^^ : pi G Pi} ai^e incorrect, which by Theorem [8] implies that is 
either _L or g{a). This establishes Condition 2. 

We now proceed to prove Condition 3. We show the existence of a set G C Z„ such that |G| > (1 — A)n 
and for each a G G, we have Pr[!D(a) = g{a)] > |. Our proof relies on the following observation: for any 
pi G Pi and p2 G P2, if a G Z„ is chosen uniformly at random, then the evaluation point api,p2 is like a 
uniformly chosen element q G B^^ '^^ . This observation implies that if a few entries in the evaluation tables 
of the multivariate reduced polynomials are corrupted, then for most a G the output of the decoder V 
on input a remains unaffected. We now formalize this observation. 

Claim 9. Fixpi G Pi, P2 G P2, and a point q G B^^'P^. Then 

4 

Pr [«Pi,P2 = (!]<—■ 

aeZ„ P2 

Proof. For any pair of positive integers m < n, the number of integers in [n] congruent to a fixed integer 
mod m is at most [^J + 1 and at least [^J — 1. Note that if a, 6 G Z„ with a = b mod m, then for any 
integer i, a* = 6* mod m. Thus, dm = bm- 

It is not hard to see that for a fixed qi G B^^ , the number of integers a G Z„ such that dp-^ = qi is at 

+ 1. Furthermore, for a fixed q2 G B^^^'p^, the number of points in B^^ that are congruent to q2 
+ 1. Thus, for a fixed q G B^^'P^, the number of integers a Zn such that api,p2 = q 



most 



mod p2 is at most 
is at most 



+ 1 



+ 1 ) , which is at most 4^ since n > pi > p2. □ 
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Now, for every pi G Pi and p2 G P2, we say that a query {q,j) G ppi-pa x [rPi'P2] is if the 
probabihty that VQ{q,j) / 5(pi,p2)('J') greater than 2~^°. By Lemma |7J the fraction of bad queries in 
Upi,p2-^'^^'^^ X [rPi'P2] is at most Aq := A^2~'^^. We say that a tuple of primes (^1,^2) G -Pi x P2 is bad if 
more than a 2^^AoA^^-fraction of queries in pPi'P^ x [r^i'P^] are bad (below, good always denotes not bad.) 
By averaging, the fraction of bad tuples (pi,P2) is at most 2~^^A. 

For a fixed good tuple {pi,P2), we say that an index '^^ is bad if more than a 2^^^A-fraction of queries 
in the copy B^^'P'^ x {iP^'P^} are bad. Since (pi,P2) is good, by averaging, at most a 2^^AoA~^-fraction 
of [r^'i'P2] are bad. Recall that in Step 2 of the decoder V, the indices {JP^'P^ : pi g Pi,P2 G P2} are 
chosen uniformly at random. So on expectation, the set of indices {jPi P^ ■ (^1,^2) is good} has at most a 
2^^AoA~^-fraction of bad indices. By Markov's inequality, with probability at least |, the fraction of bad 
indices in the set {JP^^p^ {pi,p2) is good} is at most 2^^AoA~^. We condition on this event occumng and 
fix the indices JP^^p^ for each pi G Pi, P2 G ^'2- 

Fix a good tuple (pi,P2) and a good index jP^^P^. By Claim |9l for a uniformly random a G Z„, 
the query {ap^,p2, j^^'^^) is bad with probability at most 2~^A. By linearity of expectation, for a random 
a G Zn, the expected fraction of bad queries in the set S"- = {(opj^pj' J^^'^^) ^ Pi ^ A1P2 G -P2} is at most 
2~^^A + 2^^AoA~^ + 2~^A, which is at most 2~^A by definition of Aq. Thus, by Markov's inequality, for a 
random a G with probability at least 1 — A, the fraction of bad queries in the set 5" is at most 2~®. By 
linearity of expectation, there exists some subset G C Z„ with |G| > (1 — A)n such that for every a G G, 
the fraction of bad queries in 5" is at most 2"*^. 

Now fix a G G. By definition, the fraction of bad queries in S"- is at most 2"*^, and furthermore, each of 
the good queries in 5" is incon^ect with probability at most 2^^*^. So on expectation, the fraction of errors 
and erasures in S"" is at most 2^*^ + 2"^". By Markov's inequality, with probability at least |, the fraction 

of errors and erasures in the set {t^pi^p2 '■ Pi ^ PijP2 ^ P2} is at most 2~^ + 2~^, which is at most ^. We 
condition on this event occurring. By averaging, for more than a |-fraction of the primes pi G Pi, the set 

{^pi!p2 • P2 G ^2} has at most ^-fraction of eiTors and erasures, which can be coiTccted by the CRT decoder 
Vert- Thus, after Step 3 of the decoder T>, the set {vpl^} has at most a ^-fraction of errors and erasures, 
which again will be corrected by the CRT decoder V^rt- Hence, by the union bound, the two events that we 
conditioned on earlier occur simultaneously with probability at least |, and 'C(a) will output g{a). □ 

4 Conclusion and future work 

We presented a relaxation of the notion of error-correcting data structures recently proposed in |[20l . While 
the earlier definition does not allow data structures that are both error-correcting and efficient in time and 
space (unless an unexpected breakthrough happens for constant-probe LDCs), our new definition allows 
us to construct efficient, error-correcting data structures for both the MEMBERSHIP and the POLYNOMIAL 
EVALUATION problems. This opens up many directions: what other data structures can be made error- 
correcting? 

The problem of computing rank within a sparse ordered set is a good target. Suppose we are given a 
universe [n], some nonnegative integer s < n, and a subset 5 C [n] of size at most s. The rank problem is 
to store S compactly so that on input i £ [n], the value \{j € S : j < i}\ can be computed efficiently. For 
easy information-theoretic reasons, any data structure for this problem needs length at least Q,{slogn) and 
makes r2(log s) bit-probes for each query. If s = 0(log 7i), one can trivially obtain an error-coiTccting data 
structure of optimal length 0{s log n) with 0(log^?i) bit-probes, which is only quadratically worse than 
optimal: write down 5 as a string of s log n bits, encode it with a good error-correcting code, and read the 
entire encoding when an index is queried. However, it may be possible to do something smarter and more 
involved. We leave the construction of near-optimal eiTor-correcting data structures for rank with small s 
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(as well as for related problems such as predecessor) as challenging open problems. 
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A Non-binary answer set 

We prove Lemma |71 a version of Theorem [3] when the answer set A is non-binary. We first encode the 
^IQj-bit string {f{x, q))q^Q by an RLDC, and use the decoder of the RLDC to recover each of the i bits 
of f{x,q). Now it is possible that for each q ^ Q, the decoder outputs some blank symbols _L for some 
of the bits of f{x,q), and no query could be answered correctly. To circumvent this, we first encode each 
^-bit string /(x, q) with a good error-correcting code, then encode the entire string by the RLDC. Now if 
the decoder does not output too many errors or blank symbols among the bits of the error-correcting code 
for /(x, q), we can recover it. We need a family of error-correcting codes with the following property, see 
e.g. page 668 in |[T9l . 

Fact 10. For every 6 G (0, 1/2) there exists R £ (0, 1) such that for all n, there exists a binary linear 
code of block length n, information length Rn, Hamming distance 5n, such that the code can correct from 
e errors and s erasures, as long as2e + s < 6n. 

Proof of Lemma^ We only construct an error-correcting data structure with error probability e = |. By a 
standard amplification technique (i.e., 0(log(l/e)) repetitions) we can reduce the error probability to any 
other positive constant. Let S^cc '• {0, 1}^ — )■ {0, 1}^ be an asymptotically good binary error-correcting 
code (from FactfTOl). with i' = 0{i) and relative distance |, and decoder Pecc- By Theorem[3j there exist 
Co, To > such that for every S < tq, there is a (0(1), 6, cq 5) -relaxed locally decodable code (RLDC). 
Let £q and Vq denote its encoder and decoder, respectively. 
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Encoding. We construct a data structure for / as follows. Define the encoder £ : D ^ {0,1} , where 
Ar = 0((f •|Q|)^+''),as 

£{x)=£o(^{ £ecc{fix, q)) )^^Q^ . 

Decoding. Without loss of generality, we may impose an ordering on the set Q and identify each q ^ Q 
with an integer in [Q]. 

The decoder V, with input q ^ Q and oracle access to w G {0, 1}^, does the following: 

1. For each j G [£'], let rj = {{q - + j) and set r = ri . . . r^/ G {0, 1, ±Y' . 

2. If the number of blank symbols ± in r is at least ^, then output ±. Else, output Veccir)- 

Analysis. Fix x £ D and w G {0, 1}^ such that A{w, £{x)) < 5N, and 5 < t, where r is the minimum 
of To and A2~^Cq ^. We need to argue that the above encoding and decoding satisfies the four conditions 
of Definition [T] For Condition 1, since Vq makes 0(1) bit-probes and V runs this i' times, V makes 
0(f ) = 0{i) bit-probes into w. 

We now show V satisfies Condition 2. Fix q £ Q. We want to show Fr[T>^{q) G {f{x,q), _L}] > |. 
By Theorem[3l for each j G [i'], with probability at most ^, rj = /(x, q)j ® 1. So on expectation, for at 
most a ^-fraction of the indices j, rj = f{x, q)j © 1. By Mai^kov's inequality, with probability at least |, 
the number of indices j such that rj = f{x, q)j © 1 is at most j. If the number of _L symbols in r is at least 
^ then V outputs _L, so assume the number of _L symbols is less than |-. Those _L's are viewed as erasures 
in the codeword £ecc{f{x, q))- Since £ecc has relative distance |, by Fact[lOl Pecc will correct these errors 
and erasures and output f{x, q). 

For Condition 3, we show there exists a large subset G of g's satisfying FT['D^{q) = f{x, q)] > |. Let 
y = {£ecc{f{x, q)) )q(zQ, which is a ^'|(5|-bit string. Call an index i in y bad if Pr[DQ'(z) = yi] < |. By 
Theorem |3l at most a co5-fraction of the indices in y are bad. We say that a query g G Q is bad if more than 
a ^-fraction of the bits in £ecc{f{x, q)) are bad. By averaging, the fraction of bad queries in Q is at most 
64co(5, which is at most A by our choice of r. We define G to be the set of q £ Q that are not bad. Clearly 

|G|>(i-A)|g|. 

Fix q £ G. On expectation (over the decoder's randomness), the fraction of indices in r such that 
rj ^ f{x, q)j is at most ^ + g^- Hence by Markov's inequality, with probability at least |, the fraction of 
indices in r such that rj ^ f{x, q)j is at most Thus, by Fact[lOl Veccif) will recover from these errors 
and erasures and output f{x,q). 

Finally, Condition 4 follows since the pair {£q, Vq) satisfies Condition 4, finishing the proof. □ 

B CRT codes 

In this section we explain how Theorem [8] follows from known facts. In |[T2l . Goldreich, Ron, and Sudan 
designed a unique decoding algorithm for CRT code. 

Theorem 11 (from II12II ). Given a CRT Code with basis pi < ... < pN and rate K/N, there exists a 
polynomial-time algorithm that can correct up to iogp°+k)gpjy (-^ ~ ^) ^ffors. 

By choosing the primes appropriately, we can establish Theorem [8] In particular, the following well- 
known estimate, essentially a consequence of the Prime Number Theorem, is useful. See for instance 
Theorem 4.7 in fT\ for more details. 

Fact 12. For an integer £ > 0, the £th prime (denoted qe) satisfies lilogi < qi < 13£log£ 
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Proof of Theorem^ Let K = Li^^rJ and denote the £-th prime. ByFactfTUg^ > ^KlogK > logT 

and qsK-i < 39K log 3K < 500 log T. Also, notice that lli=K^ Qi > Qk > (log ^) ^^^^^ = ^- Thus, by 
Definition 121 the CRT code with basis qk < ■ ■ ■ < q2K-i < ■ ■ ■ < qsK-i and message space Zt, has rate 
at most = \- Lastly, by Theorem [TT] the code can correct a fraction ^ — 0( ^^ ,y ) of errors. □ 
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